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EARLY  DATA  FROM  A REAL-TIME  COCHLEA 


By  Victor  W.  Bo lie 


Background 

Automatic  coding  of  speech  sounds  in  real  time,  with  minimum 
dependence  on  voice  pitch  and  individual  speaker  characteristics , 
is  a key  requirement  in  the  coming  development  of  verbally  respon- 
sive machines.  Ordinary  telephone  conversation  using  phonemes  and 
words  in  scrcunbled  order  reveals  sound  overlaps  (three  free) , 
spellings  (one  won) , and  meanings  (train  ^ train) , which  have 
been  tolerated  for  generations.  Pending  the  evolution  of  intelli- 
gently tolerant  machines,  some  sort  of  restricted  language  will 
need  to  be  developed  for  early  applications,  e.g.,  computers  pro- 
grammed by  voice  instruction. 

It  goes  without  saying  that  no  algorithm  for  word  recognition 
or  sentence  interpretation  should  be  burdened  with  separating 
phonemes  which  can  be  well  isolated  and  identified  by  immediate 
"front-end"  conversion  of  the  speech  signal.  The  artificial  ear 
used  in  the  research  reported  here  (and  elsewhere;  see  Tables  1 
and  2)  has  appeal  from  the  viewpoint  of  "naturalness."  In  parti- 
cular, it  has  laeen  anticipated  that  the  6-millisecond  propagation 
time  of  the  basilar  membrane  is  important  in  the  identification  of 
plosive  bursts  (and  in  the  separation  of  male  female  voices) . 

Hardware /Software  Problems 

The  system  previously  assembled  for  the  study  of  prolongable 
phonemes  proved  adequate  for  that  purpose  (see  Appendix  A) . From 
a reading  of  the  HP  instruction  manuals  and  I/O  specifications  avail- 
able at  that  time,  it  was  expected  that  with  the  acquisition  of  the 
HP  "burst  read"  tape,  the  same  system  could  be  used  to  collect 
data  on  the  plosive  transients.  A major  delay  in  the  research 
resulted  when  it  was  found  that  the  burst-read  mode  is  actually 
limited  to  an  uninterrupted  string  of  only  255  data  bytes.  This 
restricted  the  system  capability  to  a time-segment  capture  of 
only  16  milliseconds  of  fine-structure  data,  followed  by  a software 
gap  of  at  least  40  milliseconds  to  initiate  the  next  burst. 
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The  unavoidable  restructuring  of  the  I/O  part  of  the  system 
was  accomplished  (on  a hurried  prototype  basis)  by  detailed 
assembly  of  3072  bytes  of  external  memory,  inserted  between  the 
sampler/ADC  unit  and  the  I/O  service  unit.  The  result  of  this 
effort  was  an  extension  of  the  high-speed  read  capability  to  an 
uninterrupted  time  segment  of  1300  milliseconds. 

The  other  features  of  the  HP-9830  (reliability,  memory  size, 
flexibility,  and  low  cost)  still  maJce  it  a good  machine  for  low- 
budget  ASR  research — especially  if  ways  can  be  found  to  slow  down 
the  input  data  rate  without  losing  significant  information  carried 
in  the  speech  signal. 

High-Speed  Results 

With  the  system  modified  as  described  above,  a software  pro- 
gram was  arranged  so  that  data  collection  on  a 1300-MS  segment  of 
continuous  speech  could  be  initiated  by  merely  pressing  a "start" 
Icey.  The  data  train  (stored  first  in  the  external  memory  at  real- 
time rate,  and  ingested  later  at  computer-time  rate)  consisted  of 
64  consecutive  sweeps  of  the  16  output  taps  of  the  cochlea, 
giving  a string  of  1024  8-bit  words.  The  output  of  each  cochlear 
tap  was  then  plotted  to  visualize  the  voice-related  cheinges  in 
the  velocity  profile  of  the  basilar  membrane  vibrations. 

The  speech  used  was  that  of  a seasoned  male  voice  speaJcing  at 
a normal  rate  with  normally  varying  inflection.  The  results  for  the 
8 syllcibic  sequences: 

"Automatic" 

"Speech" 

"Recognition" 

"Compute  rized " 

"Studies" 

"Of  Cochlear" 

"Transformed" 

"Phonemes" 

are  shown  in  Figures  1 through  8,  respectively.  In  each  illustra- 
tion, the  rms  velocity  of  the  basilar  membrane  at  a given  distance 
(2,  4,  6,  ...,  32  millimeters)  from  the  stapes  is  plotted  as  a 
function  of  time. 
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As  expected,  it  is  seen  that  the  portion  of  the  basilar 
membrane  close  to  the  stapes  (e.g.,  the  2 and  4 mm  traces) 
respond  most  strongly  to  the  high  frequency  (hiss-like)  sound 
components.  In  most  of  the  records,  there  appear  to  be  moderate 
degrees  of  correlation  between  neighboring  traces,  which  would 
tend  to  dispel  the  need  for  a cochlea  of  more  than  16  outputs. 

Segments  of  momentary  silence  are  particularly  evident  in 
words  like  "speech,"  as  demonstrated  by  Figure  2.  Rise  times 
and  decay  times  of  the  various  traces  also  appear  to  be  signifi- 
Ccuit.  Further  studies  of  these  and  other  trcinsients  will  be 
required  to  extract  additional  features  needed  for  automatic 
speech  recognition. 
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The  Conputer-Coupled  Artificial  Ear 

and  Some  Preliminary  Test  Results 
by  V.  W.  Bolie 

The  system  described  elsewhere and  summarized  by  Figures 
A1  - A5  was  tested  using  the  speech  sounds  identified  by  Tables 
A1  - A3.  Twelve  samples  of  each  sound  were  captured  in  the  data 
acquisition  and  stored  on  the  tape  to  give  a total  file  of  12  x 21  = 
252  cochlear-trans formed  sounds  for  training  and  challenge  tests. 

In  loading  these  phonemes,  the  unvoiced  sounds  (4  out  of  21)  were 
held  steady,  euid  for  the  voiced  sounds  (17  out  of  21) , the  pitch 
was  varied  in  a sing-song  manner.  The  12  samples  of  each  sound 
were  used  to  develop  21  reference  vectors  and  21  tolerance  vec- 
tors, which  were  stored  as  a condensate  of  the  training. 

All  252  phoneme  samples  were  then  submitted  in  sequence  as 

(2  3) 

challenges  to  the  recognition  algorithm  ' and  the  resulting 
252  response  vectors  were  stored  for  later  study  of  errors  and 
threats.  Figure  A6  shows  the  average  response  vector  (a  hori- 
zontal row  in  the  chart)  for  any  given  sound  challenge.  Fortu- 
nately, the  largest  number  in  each  row  falls  on  the  diagonal  of 
this  21  X 21  matrix.  The  greatest  consistent  threat  appears  to 
be  that  of  the  ”0U"  against  the  "LL"  sound,  and  the  safest  sounds 
appear  to  be  the  unvoiced  ones  (SS,  FF,  KH,  SH)  . 

The  252  response  vectors  were  analyzed  further  with  respect 
to  recognition  dangers.  For  this  purpose  a measure  of  hazard  was 
constructed,  using  the  formula. 


H(E,I) 


B(E,E)  B(E,I) 
A(E,E)  - A(E,I) 


for  all  I ^ E, 


in  which  A(E,I)  is  the  element  in  Row  E and  Column  I of  the  response- 
vector  matrix  shown  in  Figure  A6 , and  in  which  B(E,I)  is  the 
average  deviation  of  the  12  contributions  to  that  element.  Each 
row  of  the  resulting  "hazard  matrix"  was  then  searched  to  find 
the  greatest  hazard  value.  The  various  phonemes  were  then  rcinked 
in  ascending  order  of  this  value.  The  results  are  listed  in 
Table  A4,  together  with  the  actual  recognition  errors  found  from 
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a trivial  search  for  the  largest  element  in  each  of  the  252  res- 
ponse vectors.  As  expected,  the  most  errors  occur  where  the  com- 
puted hazard  values  are  greatest.  For  more  detail,  the  nature  of 
perception  errors  are  listed  in  Table  A5,  where  it  is  seen  that 
practically  all  of  the  perception  errors  are  recoverable  in  the 
"second-choice"  responses. 

A pleasant  surprise  was  a finding  of  high  consistency  in  the 
first  moment  of  the  cochlear  response  to  a given  phoneme,  irrespec- 
tive of  pitch.  This  is  illustrated  in  Table  A6  in  which  the  maxi- 
m\jm  value,  minimum  value,  average,  and  standard  deviation  of  the 
first  moment  for  each  phoneme  is  listed.  Thus,  even  though  the 
FF  sound  has  a nearly  pure  noise  appearance  on  the  oscilloscope, 
it  has  a very  well  defined  cochlear  first  moment  value  (75.6  + 2.1). 

I 

I 

I 


1.  Bolie,  V.  W.  "Computer  Optimization  of  Cochlear  Design 
Parameters,"  Tech.  Rep.,  USAFOSR  Grant  No.  72-2178,  Feb., 
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(55/  2 ) *2 


where  P = 7*(N-K)/N 


= RV  for  all  1 1 K ^ N 

= RH  for  all  1 1 K ^ N 

= (Q„‘RH)/(2tt*F(K)  ) 

= (Q^*RV)/(2Tr*F{K)) 

= { (Q  "RV)  • (2Tr-  (K)  ) 

V 

= 84 


Total  number  of  sections  in  analog  cochlea 
Driving  voltage  applied  to  input  of  first  section 
Reference  milliamperes  for  I (K)  membrane  velocity 
Total  propagation  delay-time  for  100-Hz  input 


Figure  A2.  Equations  of  the  Analog  Cochlea 
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PURE-TDNE  RESPONSES  OF  THE  CDCHLER 


DISTPNCE  FROM  STAPES  (MM) 


Figure  A4.  Computer-Coupled  Artificial  Ear 
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Table  Al. 


International  Phonetic  Alphabet 


r 


\ 


1. 

lae] 

as 

in 

"bat" 

21. 

Ch] 

as 

in 

"he" 

2. 

[e] 

as 

in 

"ate" 

22. 

[s] 

as 

in 

"see" 

3. 

Ce] 

as 

in 

"ten" 

23. 

Cv] 

as 

in 

"vote" 

4. 

Ci] 

as 

in 

"beet" 

24. 

[z] 

as 

in 

"zoo" 

5. 

[1] 

as 

in 

"bit" 

25. 

[/] 

as 

in 

"shoe" 

6. 

[a] 

clS 

in 

"got" 

26. 

[0] 

as 

in 

"thin" 

7. 

[0] 

as 

in 

"go" 

27. 

Ca] 

as 

in 

"then" 

8. 

Co] 

as 

in 

"bawl" 

28. 

[3] 

as 

in 

"azure 

9. 

[u] 

as 

in 

"boot" 

10. 

[u] 

as 

in 

"book" 

29. 

[d3] 

as 

in 

"joy" 

11. 

[A] 

as 

in 

"but" 

30. 

Ct/] 

as 

in 

"chew" 

12. 

for] 

as 

in 

"burr" 

31. 

Cb] 

as 

in 

"bin" 

13. 

[1] 

as 

in 

"let" 

32. 

[d] 

as 

in 

"did" 

14. 

[r] 

as 

in 

"rat" 

33. 

[r] 

as 

in 

"get" 

15. 

Cw] 

as 

in 

"wet" 

34. 

Tk] 

as 

in 

"kill" 

16. 

[j] 

as 

in 

"you" 

35. 

Cp] 

as 

in 

"put" 

36. 

[t] 

as 

in 

"top" 

17. 

[m] 

as 

in 

"met" 

18. 

[n] 

as 

in 

"net" 

19. 

Cn] 

as 

in 

"sing" 

20. 

[f] 

as 

in 

"fall" 

t 

ix 
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Table  A2.  Listing  of  the  Prolongable  Phonemes 

a 

in  the  English  Language 


Ident 

Alpha 

Number 

Description* ** 

1 

OH 

2 

EE 

3 

EH 

4 

SS 

5 

AA 

6 

ZZ 

7 

AY 

8 

FF 

9 

OU 

10 

LL 

11 

KH 

12 

AH 

13 

RR 

14 

00 

15 

AW 

16 

SH 

17 

W 

18 

II 

19 

ZH 

20 

UH 

21 

NN 

Omitted  because  of  out-of-context  indistinguishability  are  TH  (thin 
vs  fin) , DH  (this  vs  vis) , and  the  MM  and  NG  sounds  rum  run 
rung)  . 

*The  listed  sequence  of  21  sounds  are  those  contained  in  the  sentence 

"Oh,  yes,  as  a full  car  wash  vision." 


Table  A3.  Phoneme  Structure  of  Typical  Words 


Phoneme  ^ English 

Sequence'*^  Equivalent 


RR 

TH 

Earth 

SS 

UH 

NN 

Sun 

MM 

00 

NN 

Moon 

RR 

II 

W 

RR 

River 

00 

SH 

UH 

NN 

Ocean 

SH 

AW 

RR 

Shore 

00 

EH 

DH 

RR 

Weather 

SS 

UH 

NN 

EE 

Sunny 

RR 

AY 

NN 

Rain 

SS 

NN 

OH 

Snow- 

AH 

EE 

SS 

Ice 

SS 

EE 

LL 

II 

NG 

Ceiling 

NN 

AW 

RR 

TH 

North 

SS 

AH 

00 

TH 

South 

LL 

EH 

W 

LL 

Level 

AA 

ZZ 

II 

MM 

OU  TH 

Azimuth 

EH 

LL 

EH 

W 

AY  SH  UH  NN 

Elevation 

RR 

AY 

NN 

ZH 

Range 

FF 

EE 

00 

SS 

EH  LL  AH  ZH 

Fuselage 

00 

II 

NG 

Wing 

AY 

LL 

RR 

AH 

NN 

Aileron 

NN 

OH 

ZZ 

Nose 

KH 

AA 

NN 

UH 

NN 

Cannon 

FF 

LL 

AA 

KH 

Flak 

MM 

AH 

EE 

NN 

Mine 

RR 

AH 

EE 

FF 

LL 

Rifle 

SS 

LL 

II 

NG 

Sling 

EH 

RR 

OH 

Arrow 

RR 

UH 

SH 

EE 

UH 

Russia 

FF 

RR 

AA 

NN 

SS 

France 

II 

ZZ 

RR 

AY 

LL 

Israel 

RR 

OH 

MM 

Rome 

KH 

AH 

EE 

RR 

OH 

Cairo 

MM 

AH 

EE 

AA 

MM  EE 

Miami 

AH 

RR 

MM 

EE 

Army 

NN 

AY 

W 

EE 

Navy 

MM 

UH 

RR 

EE 

NN 

Marine 

EH 

RR 

Air 

FF 

AW 

RR 

SS 

Force 

MM 

II 

LL 

II 

SH  UH 

Militia 

RR 

OH 

LL 

Roll 

SS 

KH 

OO 

EE 

ZZ 

Squeeze 

KH 

RR 

UH 

SH 

Crush 

SH 

UH 

W 

Shove 

SS 

OO 

KH 

Soak 

TH 

RR 

OH 

Throw 

The  equalities  TH  - FF,  DH  > W,  and  MM  « NG  - NN  are  made 
automatically  as  this  2-coluinn  dictionary  is  loaded  into 
memory. 
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Table  A4.  Hazard  Ranking^  of  Perceptions 

Errors  in  12 


Rank 

Sound 

Hazard 

Challenges 

1 

UH 

0.109 

0 

2 

KH 

0.133 

0 

3 

SH 

0.143 

0 

4 

SS 

0.149 

0 

5 

FP 

0.191 

0 

6 

AH 

0.233 

0 

7 

AA 

0.242 

0 

8 

OH 

0.245 

0 

9 

AY 

0.248 

0 

10 

II 

0.258 

0 

11 

AW 

0.267 

0 

12 

ZZ 

0.272 

0 

13 

EH 

0.283 

0 

14 

ZH 

0.325 

0 

15 

00 

0.350 

1 

16 

NN 

0.381 

0 

17 

EE 

0.388 

0 

18 

W 

0.467 

0 

19 

RR 

0.847 

0 

20 

OU 

1.024 

1 

21 

LL 

3.600 

5 

j 


1 


Hazard  matrix  elements  range  from  0.058  to  3.600. 
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Table  A5 


Nature  of  the  Perception  Errors 


1 


Challenge  Response  Choices 


9 

1 

OU 

W 

OU 

RR 

AA 

zz 

10 

1 

LL 

OU 

W 

LL 

RR 

zz 

10 

5 

LL 

OU 

LL 

RR 

W 

zz 

10 

6 

LL 

OU 

LL 

RR 

W 

zz 

10 

7 

LL 

OU 

LL 

RR 

W 

zz 

10 

8 

LL 

OU 

LL 

RR 

W 

zz 

14 

1 

00 

NN 

OO 

AH 

OU 

AW 

*The  original  numerical  code  for  the  challenge  sound  is  E. 

w * 

The  sanple  number  of  the  challenge  sound  is  K. 


Table  A6»  Cochlear  First-Moment  Statistics 


Phoneme 

Max 

Min 

Ave 

Dev 

(M 

-63.84 

-77.42 

-73.59 

2.09 

EE 

-30.14 

-48.38 

-39.78 

3.78 

EH 

22.23 

8.43 

15.11 

4.41 

SS 

91.84 

81.91 

86.54 

2.96 

AA 

13.09 

- 4.63 

4.32 

4.32 

ZZ 

-14.96 

-46 . 52 

-33.70 

5.68 

AY 

10.70 

-12.50 

- 1.93 

7.11 

FF 

79.42 

70.25 

75.64 

2.10 

OU 

-50.50 

-71.35 

-64 . 38 

4.21 

LL 

-69.38 

-83.98 

-77.20 

3.42 

KH 

71.58 

39.63 

63.40 

7.61 

AH 

- 2.44 

-23.05 

-13.28 

5.73 

RR 

-26.34 

-56.40 

-38.98 

9.25 

00 

-76.40 

-93.03 

-91.05 

2.62 

AW 

-38.10 

-57.23 

-46.11 

4.06 

SH 

92.80 

87.03 

90.74 

1.24 

w 

-48.75 

-76.19 

-63.08 

5.34 

II 

9.92 

-29.11 

-11.95 

11.51 

ZH 

- 2.63 

-39.76 

-27.73 

9.47 

UH 

-30.14 

-55.07 

-38.05 

6.53 

NN 

-65.47 

-88.90 

-83.71 

4.21 
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